Table of contents:
Today, we will try to implement our own convolution algorithm and convolution layer, to really see what's going on. I'll assume you already know what convolutions are and how do CNNs work. To brush up on the ideas real quick, check out these 2 videos from Udacity's "Deep Learning Nanodegree program":
from IPython.lib.display import YouTubeVideo
YouTubeVideo('RnM1D-XI--8', width=800, height=450)
YouTubeVideo('yfPEROi3SPU', width=800, height=450)
Let's see what's the state of the art convolution performance:
import torch
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import torch.nn as nn
import torch.nn.functional as F
%load_ext memory_profiler
def openImg(url):
return torch.Tensor(np.asarray(Image.open(url))).permute(2, 0, 1)
This is to demonstrate the Image.open() function used above works:
rick = Image.open('rick_sanchez.jpg'); rick
Here's actually the image as a tensor:
rick = openImg('rick_sanchez.jpg').cuda(); rick.shape
As you can see, the dimension we will we working with is (#channels, height, width). Let's also define some kernels:
identity = torch.Tensor([[0, 0, 0],
[0, 1, 0],
[0, 0, 0]]).cuda()
emboss = torch.Tensor([[-2, -1, 0],
[-1, 1, 1],
[0, 1, 2]]).cuda()
sharpen = torch.Tensor([[0, -1, 0],
[-1, 5, -1],
[0, -1, 0]]).cuda()
edgeDetect = torch.Tensor([[0, 1, 0],
[1, -4, 1],
[0, 1, 0]]).cuda()
From the documentation, it expects the image to have dimension (#samples, #input channels, image height, image width) and the kernel to have dimension (#output channels, #input channels, image height, image width):
We're throwing away the groups, to make things simpler. Let's test it out:
%time transformed = F.conv2d(rick1, kernel)
rick1 = rick.unsqueeze(0).cuda(); print(f"Image dimension: {rick1.shape}")
kernel = torch.cuda.FloatTensor(3, 3, 3, 3).fill_(0)
kernel[0, 0] = emboss # in red to out red channel
kernel[1, 1] = emboss # in green to out green channel
kernel[2, 2] = emboss # in blue to out blue channel
%time transformed = F.conv2d(rick1, kernel)
print(f"Transformed image dimension: {transformed.shape}")
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow(transformed[0].permute(1, 2, 0).cpu()/256); pass
Seems to be about right. Also note the performance, it usually takes 400$\mu s$ - 1200$\mu s$. That's super impressive.
Let's strip away all the details and just consider a 3 channel image, with a 2 dimensional kernel. Let's try to do this using vanilla Python:
rick1 = rick.cpu(); print(f"Image dimension: {rick1.shape}")
kernel = emboss
# expects image to be (#channel, height, width), and kernel to have 2 dimensions
def conv(image, kernel):
_, height, width = rick1.shape
transformed = torch.zeros(3, height-2, width-2)
for pixelY in range(0, height-2):
print(f"\rProgress: {np.round(10000*pixelY/height)/100.0}% ", end="")
for pixelX in range(0, width-2):
red, green, blue = 0, 0, 0
for kernelX in range(3):
for kernelY in range(3):
red += kernel[kernelY, kernelX] * rick1[0, pixelY+kernelY, pixelX+kernelX]
green += kernel[kernelY, kernelX] * rick1[1, pixelY+kernelY, pixelX+kernelX]
blue += kernel[kernelY, kernelX] * rick1[2, pixelY+kernelY, pixelX+kernelX]
transformed[0, pixelY-1, pixelX-1] = red
transformed[1, pixelY-1, pixelX-1] = green
transformed[2, pixelY-1, pixelX-1] = blue
print()
return transformed
%time transformed = conv(rick1, kernel)
print(f"Transformed image dimension: {transformed.shape}")
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow(transformed.permute(1, 2, 0)/256)
Okay, that's sort of unbearable. 20 minutes and 30s, or 1230s. More than 2.4 million times slower. This is completely unacceptable. Why does this happen? There're a couple of reasons.
First of all, it is written in Python, a dynamic, interpreted language. This means that variables don't have to explicitly declare type information and the programmer can just loosely tell Python what to do and it will do it. This also means that it takes a whole bunch of time to interpret the code on the fly. It also takes time to check the type of each variable before doing anything with it.
Second, Python has something called "Global Interpreter Lock" that ensures only 1 thread can run at a given point. My computer has 8 cores, but vanilla Python can't really utilize more than 1. Compared to the convolution function provided by PyTorch where they can utilize something like 600 cores on a GPU, all running very optimized compiled C code, sometimes even a little assembly, vanilla Python is no match.
So it seems like we can't really do this right? We can't really write C code in Python, and even if we can, it'll just be so stupid hard to make it work that it won't worth the extra insight we can gain.
Actually, we can do this. I have actually managed to achieve 2 times slower speed than state of the art using 10 lines of Python. The general mentality is to convert the vanilla Python solution to tensors and operations on them. Let's see how to do that.
Let's set things up first:
rick1 = rick[0]; print(f"Image dimension (red channel): {rick1.shape}")
kernel = emboss; print(f"Kernel dimension: {kernel.shape}")
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow(rick1.cpu()); rick1.mean() # to prove the colors are from 0-255
How are convolutions calculated? If we're just focusing on 1 channel at a time, then for each output pixel, we have to multiply the kernel with the pixels surrounding it and add them together. In the vanilla implementation, we loop through every pixel, then we loop through all the kernel items. The loop through every pixel part seems like a good target for utilizing tensors, because there's a lot of looping going on compared to the kernel. So a possible solution is to multiply the image with an item in the kernel in the GPU, then repeat this for every item in the kernel with a simple loop. How will this look like?
Let's define 9 images from a11 to a33, each is the base image multiplied with a kernel item:
print(f"Kernel: \n{kernel}")
a11 = rick1 * kernel[0, 0]
a12 = rick1 * kernel[0, 1]
a13 = rick1 * kernel[0, 2]
a21 = rick1 * kernel[1, 0]
a22 = rick1 * kernel[1, 1]
a23 = rick1 * kernel[1, 2]
a31 = rick1 * kernel[2, 0]
a32 = rick1 * kernel[2, 1]
a33 = rick1 * kernel[2, 2]; print(f"a33's shape: {a33.shape}")
What happens if we just add together all of these images? Should it work? Let's see:
a = a11 + a12 + a13 + a21 + a22 + a23 + a31 + a32 + a33
a.mean()
The mean looks reasonable. Let's display it:
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow(a.cpu())
Doesn't look like the emboss effect is applied. Let's do it again, but with edge detect to make sure:
rick1 = rick[0]; kernel = edgeDetect
a = rick1 * kernel[0, 0] + rick1 * kernel[0, 1] + rick1 * kernel[0, 2] + rick1 * kernel[1, 0] + rick1 * kernel[1, 1] + rick1 * kernel[1, 2] + rick1 * kernel[2, 0] + rick1 * kernel[2, 1] + rick1 * kernel[2, 2]
plt.imshow(a.cpu())
a.mean()
Yep, something is off here. No edge detection whatsoever, and the mean pixel value is 0. What are we doing with the operation above? We're multiplying each image with a number in the kernel, then add them together. We can do the same thing by just summing all items in the kernel, then multiply them with the image. Sum of emboss is 1, and sum of edge detect is 0, so it makes sense.
No, we have to translate each image a little bit, then add all of them together. We can do this using the roll() function:
a = torch.Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]]); display(a)
display(a.roll((1), (0)))
display(a.roll((1, 1), (0, 1)))
You can see what it does to the tensor. The first argument is a tuple with the integer shifts, and the second argument is a tuple with the dimension number.
rick1 = rick[0]; print(f"Image dimension (red channel): {rick1.shape}")
kernel = emboss; print(f"Kernel dimension: {kernel.shape}")
a11 = (rick1 * kernel[0, 0]).roll((1, 1), (1, 0))
a12 = (rick1 * kernel[0, 1]).roll((0, 1), (1, 0))
a13 = (rick1 * kernel[0, 2]).roll((-1, 1), (1, 0))
a21 = (rick1 * kernel[1, 0]).roll((1, 0), (1, 0))
a22 = (rick1 * kernel[1, 1]).roll((0, 0), (1, 0))
a23 = (rick1 * kernel[1, 2]).roll((-1, 0), (1, 0))
a31 = (rick1 * kernel[2, 0]).roll((1, -1), (1, 0))
a32 = (rick1 * kernel[2, 1]).roll((0, -1), (1, 0))
a33 = (rick1 * kernel[2, 2]).roll((-1, -1), (1, 0))
a = a11 + a12 + a13 + a21 + a22 + a23 + a31 + a32 + a33
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow(a.cpu())
a.mean()
Seems correct. How about the version where we have 3 channels?
rick1 = rick; print(f"Image dimension: {rick1.shape}")
kernel = emboss; print(f"Kernel dimension: {kernel.shape}")
a11 = (rick1 * kernel[0, 0]).roll((1, 1), (2, 1))
a12 = (rick1 * kernel[0, 1]).roll((0, 1), (2, 1))
a13 = (rick1 * kernel[0, 2]).roll((-1, 1), (2, 1))
a21 = (rick1 * kernel[1, 0]).roll((1, 0), (2, 1))
a22 = (rick1 * kernel[1, 1]).roll((0, 0), (2, 1))
a23 = (rick1 * kernel[1, 2]).roll((-1, 0), (2, 1))
a31 = (rick1 * kernel[2, 0]).roll((1, -1), (2, 1))
a32 = (rick1 * kernel[2, 1]).roll((0, -1), (2, 1))
a33 = (rick1 * kernel[2, 2]).roll((-1, -1), (2, 1))
a = a11 + a12 + a13 + a21 + a22 + a23 + a31 + a32 + a33
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow(a.permute(1, 2, 0).cpu()/256); pass
Nice. Let's bunch it in a function and measure its performance.
rick1 = rick; print(f"Image dimension: {rick1.shape}")
kernel = emboss; print(f"Kernel dimension: {kernel.shape}")
def conv2(img, kernel):
a11 = (img * kernel[0, 0]).roll((1, 1), (2, 1))
a12 = (img * kernel[0, 1]).roll((0, 1), (2, 1))
a13 = (img * kernel[0, 2]).roll((-1, 1), (2, 1))
a21 = (img * kernel[1, 0]).roll((1, 0), (2, 1))
a22 = (img * kernel[1, 1]).roll((0, 0), (2, 1))
a23 = (img * kernel[1, 2]).roll((-1, 0), (2, 1))
a31 = (img * kernel[2, 0]).roll((1, -1), (2, 1))
a32 = (img * kernel[2, 1]).roll((0, -1), (2, 1))
a33 = (img * kernel[2, 2]).roll((-1, -1), (2, 1))
return a11 + a12 + a13 + a21 + a22 + a23 + a31 + a32 + a33
%time transformed = conv2(rick1, kernel)
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow(transformed.permute(1, 2, 0).cpu()/256); pass
Holy shit, did you see that? Now it only takes 838$\mu s$. This is very close to the state of the art.
Now let's try to bring this over to the definitions for F.conv2d(). The function below takes in a (3, 3) kernel, and spit out a (3, 3, 3, 3) kernel, taken from the example way at the beginning
def advancedKernel(inKernel):
kernel = torch.cuda.FloatTensor(3, 3, 3, 3).fill_(0)
kernel[0, 0] = inKernel # in red to out red channel
kernel[1, 1] = inKernel # in green to out green channel
kernel[2, 2] = inKernel # in blue to out blue channel
return kernel
emboss1 = advancedKernel(emboss)
print(f"Emboss shape: {emboss1.shape}")
emboss1
Now to the image:
rick1 = rick.unsqueeze(0); print(f"Image dimension: {rick1.shape}")
Now on to the function. This looks familiar to the function conv2() we defined above, but there are more moving parts:
def conv3(img, kernel):
img = img.permute(0, 2, 3, 1)
kernel = kernel.permute(2, 3, 1, 0)
a11 = (img @ kernel[0, 0]).roll((1, 1), (2, 1))
a12 = (img @ kernel[0, 1]).roll((0, 1), (2, 1))
a13 = (img @ kernel[0, 2]).roll((-1, 1), (2, 1))
a21 = (img @ kernel[1, 0]).roll((1, 0), (2, 1))
a22 = (img @ kernel[1, 1]).roll((0, 0), (2, 1))
a23 = (img @ kernel[1, 2]).roll((-1, 0), (2, 1))
a31 = (img @ kernel[2, 0]).roll((1, -1), (2, 1))
a32 = (img @ kernel[2, 1]).roll((0, -1), (2, 1))
a33 = (img @ kernel[2, 2]).roll((-1, -1), (2, 1))
return (a11 + a12 + a13 + a21 + a22 + a23 + a31 + a32 + a33).permute(0, 3, 1, 2)
kernel1 = advancedKernel(edgeDetect)
%time transformed = conv3(rick1, kernel1)
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow((transformed[0].permute(1, 2, 0)/256).cpu())
Seems like it's working fine. Let's unpack what's going on here. The image's shape is:
(#samples, #channels, iHeight, iWidth), with prefix "i" stands for "image". The kernel's shape is:
(#out channels, #in channels, kHeight, kWidth), with prefix "k" stands for "kernel".
Let's think about the core operation we did in function conv2(): img * kernel[0, 0]
The picking out specific kernel items part, actually picks from the last 2 dimension of kHeight and kWidth. Thus, in function conv3(), kernel[0, 0] is a tensor with shape (#out channels, #in channels). This looks too familiar with the common matrix operation. Normally, an (a, b) weight tensor "translates" from samples of a features to samples of b features. Here, a beginning image has a channels, and the output image has b channels, and the kernel now really looks like a simple transformation. But the dimension is reversed, so we have to switch the channels positions using permute. We also want to be able to pick out a 2 dimensional kernel with 2 beginning indexes, just for decoration. Therefore, the correct permutation should be kernel.permute(2, 3, 1, 0)
Same thing as the above, now that the input and output channels are distinct, we have to build that into the image tensor itself. It seems quite convenient if the channel dimension part is the last part, so we can just use a normal matrix multiplication on the image and the kernel tensors. So the correct image permutation should be img = img.permute(0, 2, 3, 1). After that, we can just roll and it's fine again.
And the performance is not too bad: 782$\mu s$. Surprisingly, this is pretty much the same performance as our conv2() function above. I still don't understand why this is the case, as the image and kernel is larger and more complex, and there's a matrix multiplication instead of a simple pairwise multiplication. All in all, I think this is a success. Now let's try to make the convolution better.
def conv4(imgs, kernel):
imgs = imgs.permute(0, 2, 3, 1)
kernel = kernel.permute(2, 3, 1, 0)
ks, inChannels, outChannels = kernel.shape[0], kernel.shape[2], kernel.shape[3]
padding = ks - 1
samples, height, width, _ = imgs.shape
transformed = torch.cuda.FloatTensor(samples, height-padding, width-padding, outChannels).zero_()
for yKernel in range(ks):
for xKernel in range(ks):
transformed += (imgs @ kernel[yKernel, xKernel])[:, yKernel:height-padding+yKernel, xKernel:width-padding+xKernel, :]
return transformed.permute(0, 3, 1, 2)
Here, the idea is sort of similar, but with a little more complexity, just to accomodate for changing kernel size. Also, we're not using roll() function anymore, and we will just sample smaller images directly from the tensor.
kernel1 = advancedKernel(emboss)
rick1 = rick.unsqueeze(0)
%time transformed = conv4(rick1, kernel1)
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.imshow(transformed[0].permute(1, 2, 0).cpu()/256); pass
The performance here is even better, at 769 $\mu s$, and I still don't understand why it's like that.
The idea for this is simple enough. We already have the kernel, ready to be passed in. So we can just create kernels with the correct size. Let's setup everything first, so we get where we're going with this:
import torchvision.datasets as datasets
import torchvision
import torch.optim as optim
import time
Here's our data loaders. I'm pretty lazy to create my own custom dataset class, so I'm just gonna use PyTorch's default datasets:
dl = torch.utils.data.DataLoader(datasets.MNIST("datasets/", download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize((0.1307,), (0.3081,))
])), batch_size=1000, shuffle=True)
dl_test = torch.utils.data.DataLoader(datasets.MNIST("datasets/", download=True, train=False,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize((0.1307,), (0.3081,))
])), batch_size=1000, shuffle=True)
Then here's our network. Pretty straightforward, all components are standard for a CNN. Granted, you don't need to build a CNN for MNIST because it's a bit overkill, but we're still gonna do it to test out our convolution module. Also note we can quite easily slip in different convolution module:
class Net(nn.Module):
def __init__(self, convModule):
super().__init__()
self.conv1 = convModule(1, 3, 3)
self.conv2 = convModule(3, 3, 3)
self.conv3 = convModule(3, 6, 3)
self.conv4 = convModule(6, 6, 3)
self.pool = nn.MaxPool2d(2)
self.relu = nn.ReLU()
self.logSoftmax = nn.LogSoftmax(1)
self.fc1 = nn.Linear(96, 30)
self.fc2 = nn.Linear(30, 10)
self.dropout = nn.Dropout(0.2)
def forward(self, x):
x = self.relu(self.conv1(x))
x = self.relu(self.conv2(x))
x = self.pool(x)
x = self.relu(self.conv3(x))
x = self.relu(self.conv4(x))
x = self.pool(x)
x = self.dropout(x.contiguous().view(-1, 6 * 4 * 4))
x = self.dropout(self.relu(self.fc1(x)))
return self.logSoftmax(self.fc2(x))
This is the convolution module. As you can see, we are initializing a kernel, and a bias for each output channels. The initialization is involved, with very specific initialization schema, but the actual calculation is basically just conv4() above:
class Conv2d(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size=3):
super().__init__()
self.in_channels = in_channels
self.out_channels = out_channels
self.kernel_size = kernel_size
kSqrt = 1/np.sqrt(in_channels*kernel_size**2)
self.kernel = nn.Parameter(torch.rand(out_channels, in_channels, kernel_size, kernel_size)*2*kSqrt-kSqrt)
self.bias = nn.Parameter(torch.rand(out_channels).unsqueeze(-1).unsqueeze(-1)*2*kSqrt-kSqrt)
def forward(self, imgs):
return conv4(imgs, self.kernel) + self.bias
And here's our training function. Again, pretty straightforward, and this will return graphable stuff, to visualize the performance better:
def train(net, dl=dl, dl_test=dl_test):
lossFunction= nn.NLLLoss()
optimizer = optim.Adam(net.parameters(), lr=0.001)
losses = []
accuracies = []
times = []
count = 0
initialTime = time.time()
for epoch in range(5):
for imgs, labels in dl:
count += 1
optimizer.zero_grad()
imgs, labels = imgs.cuda(), labels.cuda()
output = net(imgs)
loss = lossFunction(output, labels)
if count % 10 == 0:
test_imgs, test_labels = next(iter(dl_test))
accuracies.append((torch.argmax(net(test_imgs.cuda()), dim=1) == test_labels.cuda()).sum())
times.append(int(time.time()-initialTime))
losses.append(loss.item())
print(f"\rLoss: {losses[-1]}, accuracy: {accuracies[-1]}/1000 ", end="")
loss.backward()
optimizer.step()
return times, losses, accuracies
And here's actually training it:
times1, losses1, acc1 = train(Net1(nn.Conv2d).cuda())
times2, losses2, acc2 = train(Net(Conv2d).cuda())
Let's see how they fair against each other:
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.plot(times1, np.array(acc1)/10, "r")
plt.plot(times2, np.array(acc2)/10, "b")
plt.legend(["Built in", "From scratch"])
plt.xlabel("Time (s)")
plt.ylabel("% correct")
plt.show()
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.plot(times1, losses1, "r")
plt.plot(times2, losses2, "b")
plt.legend(["Built in", "From scratch"])
plt.xlabel("Time (s)")
plt.ylabel("Loss"); pass
What if they execute at the same speed? Are there any accuracy degradation?
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.plot(times1, np.array(acc1)/10, "r")
plt.plot(times1, np.array(acc2)/10, "b")
plt.legend(["Built in", "From scratch"])
plt.xlabel("Scaled time (s)")
plt.ylabel("% correct")
plt.figure(num=None, figsize=(10, 6), dpi=350)
plt.plot(times1, losses1, "r")
plt.plot(times1, losses2, "b")
plt.legend(["Built in", "From scratch"])
plt.xlabel("Scaled time (s)")
plt.ylabel("Loss"); pass
They're pretty close actually, but the one we've built from scratch is a little bit more accurate. I have not tested this many times, so it may just be random events.
So on average in independent testing and in real training, our convolution layer built from scratch is around 2 times slower as the built in layer. Not too shabby, considering people have spent a lot of time thinking about this and optimize the shit out of their systems, and convolutions are not the simplest of layers.
We still haven't dealt with custom strides, and the transpose convolution too. Custom stride seems easy, but currently I can't think of a way to do it fast, and I'm still not sure how to recreate the transpose convolution behavior, so that should be a topic for another day.